Back

Statistics in Medicine

Wiley

Preprints posted in the last 90 days, ranked by how well they match Statistics in Medicine's content profile, based on 34 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
Simulation-Based Comparison of ControlledInterrupted Time Series (CITS) and Multivariable Regression

ORWA, F. O.; Mutai, C.; Nizeyimana, I.; Mwangi, A.

2026-04-13 health policy 10.64898/2026.04.10.26350670 medRxiv
Top 0.1%
10.2%
Show abstract

When randomized controlled trials are impractical, interrupted time series designs offer a rigorous quasi-experimental approach to assess population level policies. Indeed, in the context of quasi-experimental designs (QEDs), the Interrupted Time Series (ITS) method is commonly thought of as the most robust. But interrupted time series designs are susceptible to serial correlation and confounding by time-varying factors associated with both the intervention and the outcome, which may result in biased inference. Thus, we provide a simulation-based contrast of controlled interrupted time series (CITS) and multivariable regression (multivariable negative binomial regression) for estimation of policy effects in count time series data. These approaches are widely used in policy evaluations, yet their comparative performance in typical population health settings has rarely been examined directly. We tested both approaches within a variety of data generating situations, differing in the series length, intervention effect size, and magnitude of lag-1 autocorrelation. Bias, standard error calibration, confidence interval coverage, mean squared error, and statistical power were assessed for performance. Both methods gave unbiased estimates for moderate and large intervention effects, although bias was more pronounced for small effects, particularly in short series. Although the point estimate performance was similar, inferential properties varied significantly. CITS always had smaller mean squared error, better consistency between model based and empirical standard errors, and confidence interval coverage near the 95% nominal levels over weak to moderate autocorrelation. By contrast, multivariable regression was more sensitive to serial dependence, leading to underestimated standard errors and undercoverage, especially at moderate to high autocorrelation, regardless of Newey-West adjustments. These findings show the benefits of using a concurrent control series and the importance of structurally accounting for serial correlation when studying population level policies with time series data.

2
Bias and Variance of Adjusting for Instruments

Hripcsak, G.; Anand, T.; Chen, H. Y.; Zhang, L.; Chen, Y.; Suchard, M. A.; Ryan, P. B.; Schuemie, M. J.

2026-03-15 epidemiology 10.64898/2026.03.13.26348328 medRxiv
Top 0.1%
9.9%
Show abstract

Propensity score adjustment is commonly used in observational research to address confounding. Controversy persists about how to select covariates as possible confounders to generate the propensity model. A desire to include all possible confounders is offset by a concern that more covariates will augment bias or increase variance. Much of concern is over instruments, which are variables that affect the treatment but not the outcome. Adjusting for an instrument has been shown to increase bias due to unadjusted confounding and to increase the variance of the effect estimate. Large-scale propensity score (LSPS) adjustment includes most available pre-treatment covariates in its propensity model. It addresses instruments with a pair of diagnostics, ceasing the analysis if any covariate exceeds a correlation coefficient of 0.5 with the treatment and checking for an aggregation of instruments with equipoise reported as a preference score. Our simulation assesses the impact of adjusting for instruments in the context of LSPSs diagnostics. In our simulation, even when the variance of the treatment contributed by the adjusted instrument(s) exceeds an unadjusted confounder by over twenty-fold, when the correlation between the instrument(s) and the treatment was less than 0.5 and the equipoise was greater than 0.5, the additional shift in the effect estimate due to adjusting for the instrument(s) was less than the shift due to confounding by itself. Therefore, we find in this simulation that adjusting for instruments contributed a minor amount of bias to the effect estimate. This simulation aligns well with a previous assessment of the impact of adjusting for instruments and with separate empirical evidence that adjusting for many covariates surpasses attempts to identify a limited set of confounders.

3
Regression-based Modeling of Spearman's Rho for Longitudinal Metabolomics and Mental Wellness in Breast Cancer Patients

Chen, Y.; Gui, T.; Huang, Z.; Quach, N.; Tu, S.; Liu, J.; Garrett, T. J.; Starkweather, A. R.; Lyon, D. E.; Shepherd, B. E.; Tu, X. M.; Lin, T.

2026-04-16 cancer biology 10.64898/2026.04.13.718341 medRxiv
Top 0.1%
9.2%
Show abstract

SO_SCPLOWUMMARYC_SCPLOWChemotherapy in breast cancer (BC) can substantially affect mental wellness. Advances in metabolomics enable comprehensive profiling of metabolic changes over time during and after treatment, offering insights into biological mechanisms linking chemotherapy to mental health outcomes. To study the association between metabolite profiles and mental wellness, correlation-based analyses are particularly useful. Spearmans rho is a widely used correlation measure and popular alternative to Pearsons correlation, since it also applies to non-linear association between variables. However, existing methods are not designed for longitudinal data and do not allow for covariate adjustments. In this paper, we propose a novel regression-based framework grounded in a class of semiparametric models, the functional response models, to extend this popular correlation measure to longitudinal settings with missing data under the missing at random assumption. This framework facilitates inferences about temporal changes in correlations over time and association of explanatory variables for such changes. We use simulation studies to evaluate performance of the approach with moderate sample sizes. We apply the approach to a one-year longitudinal substudy of the EPIGEN study to examine the longitudinal association between metabolite profiles and mental wellness in BC patients undergoing chemotherapy. The identified metabolites may serve as candidates for future in-depth bioinformatics analyses and translational investigations.

4
Classification of Adolescent Drinking via Behavioral, Biological, and Environmental Features: A Machine Learning Approach with Bias Control

Liu, R.; Azzam, M.; Zabik, N.; Wan, S.; Blackford, J.; Wang, J.

2026-02-26 addiction medicine 10.64898/2026.02.24.26347002 medRxiv
Top 0.1%
5.0%
Show abstract

In 2024, approximately 30% of U.S. adolescents reported having consumed alcohol at least once in their lifetime, with about 25% of these individuals engaging in binge drinking. Adolescent alcohol use is associated with neurodevelopmental impairments, elevated risk of later alcohol use, and mental health disorders. These findings underscore the importance of identifying the variables driving adolescent alcohol use and leveraging them for early identification and targeted intervention. Previous studies have typically developed machine-learning classification models that use neuroimaging data in combination with limited clinical measurements. Neuroimaging data are expensive and difficult to obtain at scale, whereas clinical measures are more practical for large-scale screening due to their low cost and widespread accessibility. However, clinical-only approaches for alcohol drinking classification remain largely underexplored. Furthermore, prior studies have often focused on adults, limiting generalizability to the broader adolescent population. Additionally, confounding factors such as age and substance use, which are strongly correlated with alcohol consumption, have often been inadequately addressed, potentially inflating classification performance. Finally, class imbalance remains a persistent challenge, with prior attempts yielding only limited improvements. To address these limitations, we propose FocalTab, a framework that integrates TabPFN with focal loss for robust generalization and effective mitigation of class imbalance. The approach also incorporates an initial preprocessing step to remove confounding factors to account for age and substance-use. We compare FocalTab against state-of-the-art methods across different variable selections and dataset settings. FocalTab achieves the highest accuracy (84.3%) and specificity (80.0%) in the most stringent setting, in which both age and substance use variables were excluded, whereas competing models drop to near-chance specificity (12-24%). We further applied SHapley Additive exPlanations (SHAP) analysis to identify key clinical predictors of drinker classification, supporting enhanced screening and early intervention.

5
Using Negative Control Outcomes to Detect Selection Bias in Mendelian Randomization Studies

Gkatzionis, A.; Davey Smith, G.; Tilling, K.

2026-02-01 epidemiology 10.64898/2026.01.30.26345215 medRxiv
Top 0.1%
4.9%
Show abstract

Mendelian randomization is currently mainly implemented through the use of genetic variants as instrumental variables to investigate the causal effect of an exposure on an outcome of interest. Mendelian randomization studies are robust to confounding bias and reverse causation, but they remain susceptible to selection bias; for example, this can happen if the exposure or outcome are associated with selection into the study sample. Negative controls are sometimes used to detect biases (typically due to confounding) in observational studies. Here, we focus specifically on Mendelian randomization analyses and discuss under what conditions a variable can be used as a negative control outcome to detect selection mechanisms that could bias Mendelian randomization estimates. We show that the main requirement is that the negative control outcome relates to confounders of the exposure and outcome. Counter-intuitively, the effect of the negative control on selection is of secondary concern; for example, a variable that does not affect selection can be a valid negative control for an outcome that does. We also investigate under what conditions age and sex can be used as negative control outcomes in Mendelian randomization analyses. In a real-data application, we investigate the pairwise causal relationships between 19 traits, utilizing data from the UK Biobank. Treating biological sex as a negative control outcome, we identify selection bias in analyses involving commonly used traits such as alcohol consumption, body mass index and educational attainment.

6
An E-value-Informed Sensitivity Analysis Framework for Hybrid Controlled Trials

Liu, C.; Mayer, M.; Lactaoen, K.; Gomez, L.; Weissman, G.; Hubbard, R.

2026-03-06 epidemiology 10.64898/2026.03.05.26347653 medRxiv
Top 0.1%
4.9%
Show abstract

Hybrid controlled trials (HCTs) incorporate real-world data into randomized controlled trials (RCTs) by augmenting the internal control arm with patients receiving the same treatment in routine care. Beyond increasing power, HCTs may improve recruitment by supporting unequal randomization ratios that increase patient access to experimental treatments. However, HCT validity is threatened by bias from unmeasured confounding due to lack of randomization of external controls, leading to outcome non-exchangeability between internal and external control patients. To address this challenge, we developed a sensitivity analysis framework to assess the robustness of HCT results to potential unmeasured confounding. We propose a tipping point analysis that adapts the E-value framework to the HCT setting where trial participation rather than treatment assignment is subject to confounding. To aid interpretation, we also introduce a data-driven benchmark representing the strength of unmeasured confounding reflected by the observed outcome non-exchangeability. We then propose an operational decision rule and evaluate its performance through simulation studies. Finally, we illustrate the approach using an asthma trial augmented by data from electronic health records. Simulation results demonstrate that our decision rule safeguards against Type I error inflation while preserving the power gains achieved by incorporating external data. In settings where moderate unmeasured confounding led to poorer outcomes for external controls, Type I error was controlled near the nominal 5% level, and power increased by 10-20% compared with analyses using RCT data alone. Our approach provides a practical, interpretable method to assess HCT robustness, supporting rigorous inference when integrating external real-world data.

7
Causal estimands and target trials for the effect of lag time to treatment of cancer patients

Goncalves, B. P.; Franco, E. L.

2026-04-08 epidemiology 10.64898/2026.04.07.26350338 medRxiv
Top 0.1%
4.8%
Show abstract

Timeliness of therapy initiation is a fundamental determinant of outcomes for many medical conditions, most importantly, cancer. Yet, existing inefficiencies in healthcare systems mean that delays between diagnosis and treatment frequently adversely affect the clinical outcome for cancer patients. Although estimates of effects of lag time to therapy would be informative to policymakers considering resource allocation to minimize delays in oncology, causal methods are seldom explicitly discussed in epidemiologic analyses of these lag times. Here, we propose causal estimands for such studies, and outline the protocol of a target trial that could be emulated with observational data on lag times. To illustrate the application of this approach, we simulate studies of lag time to treatment under two scenarios: one in which indication bias (Waiting Time Paradox) is present and another in which it is absent. Although our discussion focuses on oncologic outcomes, components of the proposed target trial could be adapted to study delays for other medical conditions. We believe that the clarity with which causal questions are posed under the target trial emulation framework would lead to improved quantification of the effects of lag times in oncology, and hence to better informed policy decisions.

8
Incorporating Uncertainty in Study Participants' Age in Serocatalytic Models

Chen, J.; Lambe, T.; Kamau, E.; Donnelly, C.; Lambert, B.; Bajaj, S.

2026-03-16 infectious diseases 10.64898/2026.03.14.26346885 medRxiv
Top 0.1%
4.3%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSerological surveys measure the presence of antibodies in a population to infer past exposure to an infectious pathogen. If study participants ages are known, serocatalytic models can be used to retrace the historical transmission strength of a pathogen within that population, quantified by the force of infection (FOI). These models rely on age information as a key variable since infection risks are interpreted in relation to how long individuals have been at risk. However, due to data constraints, participants ages may be provided only within "age bins". A common approach is then to assign individuals ages to be midpoints of their respective age bins, ignoring uncertainty in this quantity. In this study, we quantify the bias introduced by this midpoint approach and develop a Bayesian framework that explicitly accounts for uncertainty in age. By comparing inference under constant, age-dependent, and time-dependent FOI scenarios, we show that incorporating uncertainty in age in serocatalytic models yields more reliable FOI estimates without sacrificing computational complexity. These improvements support the interpretation of serological data and inform public health decisions, such as estimating disease burden and identifying targeted vaccination groups.

9
A Machine Learning Based Causal Interface for Time-Varying Environmental Predictors of Substance Use Initiation in the ABCD Study

Wei, M.; Yadlapati, L.; Peng, Q.

2026-04-17 addiction medicine 10.64898/2026.04.15.26350988 medRxiv
Top 0.1%
3.9%
Show abstract

Background: The Adolescent Brain Cognitive Development (ABCD) Study provides rich longitudinal data on environmental, genetic, and behavioral factors related to substance use initiation. Classical marginal structural models (MSMs) require selecting covariates for propensity models, which is challenging when there are many correlated predictors. Methods: We analyzed longitudinal panel data from 11,868 ABCD participants with repeated observations over time. Interval-level binary outcomes were defined for initiation of alcohol, nicotine, cannabis, and any substance, including only participants at risk before initiation. All predictors were constructed as lagged variables to preserve temporal ordering. We used a two-stage machine learning-based causal framework. First, we performed graph discovery using a Granger-inspired lagged predictive modeling approach with elastic-net logistic regression to identify relationships between past predictors and future outcomes. Stable candidate edges were selected using subject-level bootstrap stability selection. Second, we estimated adjusted effects for stable predictors using double machine learning (DML) with partialling-out and cross-fitting. For each predictor, the lagged variable was treated as the exposure and adjusted for high-dimensional lagged covariates. Cross-fitting with group-based splitting accounted for within-subject dependence. Nuisance functions were estimated using random forests, and cluster-robust standard errors were used for inference. Results: We identified stable predictors across multiple domains, including sleep patterns, family environment, peer relationships, behavioral traits, and genetic risk. Many predictors were shared across substance outcomes, while some were outcome-specific. Effect sizes were modest, typically ranging from -0.01 to 0.02 per standard deviation increase in the predictor. Both risk-increasing and protective associations were observed. Risk factors included sleep disturbance and behavioral risk indicators, while protective factors included parental monitoring and structured environments. Conclusions: This study presents a practical framework for analyzing high-dimensional longitudinal data and identifying time-varying predictors of substance use initiation. The approach combines machine learning for variable selection with causal inference for effect estimation. The results highlight both shared and outcome-specific risk factors and identify modifiable targets, such as family environment and sleep, that may inform prevention strategies.

10
A bootstrap particle filter for viral Rt inference and forecasting using wastewater data

Xiao, W. F.; Wang, Y.; Goel, N.; Wolfe, M.; Koelle, K.

2026-03-06 epidemiology 10.64898/2026.03.06.26347747 medRxiv
Top 0.1%
3.7%
Show abstract

Wastewater is increasingly being recognized as an important data stream that can contribute to infectious disease surveillance and forecasting. With this recognition, a growing number of statistical inference approaches are being developed to use wastewater data to provide quantitative insights into epidemiological dynamics. However, few existing approaches have allowed for systematic integration of data streams for inference, for example by combining case incidence data and/or serological data with wastewater data. Furthermore, only a subset of existing approaches have been able to handle missing data without imputation and to handle datasets with different sampling times or intervals. Here, we develop a statistically rigorous, yet lightweight, approach to infer and forecast time-varying effective reproduction numbers (Rt values) using longitudinal wastewater virus concentrations either alone or jointly with additional data streams including case incidence data and serological data. Our approach relies on a state-space modeling approach for inference and forecasting, within the context of a simple bootstrap particle filter. We first describe the structure of our underlying disease transmission process model as well as our observation models. Using a mock dataset, we then show that Rt can be accurately estimated by interfacing this model with case incidence data, wastewater data, or a combination of these two data streams using the bootstrap particle filter. Of note, we show that these data streams alone do not allow for reconstruction of underlying infection dynamics due to structural parameter unidentifiability. We then apply our particle filter to a previously analyzed SARS-CoV-2 dataset from Zurich that includes case data and wastewater data. Our analyses of these real-world datasets indicate that incorporation of process noise (in the form of environmental stochasticity) into the state space model greatly improves our ability to reconstruct the latent variables of the model. We further show that underlying infection dynamics can be made identifiable through the incorporation of serological data and that the bootstrap particle filter can be used to make forecasts of Rt, case incidence, and wastewater virus concentrations. We hope that the inference approach presented here will lead to greater reliance on wastewater data for disease surveillance and forecasting that will aid public health practitioners in responding to infectious disease threats.

11
The Rayleigh Quotient and Contrastive Principal Component Analysis II

Jackson, K. C.; Carilli, M. T.; Pachter, L.

2026-04-10 bioinformatics 10.64898/2026.04.08.717236 medRxiv
Top 0.1%
3.7%
Show abstract

Contrastive principal component analysis (PCA) methods are effective approaches to dimensionality reduction where variance of a target dataset is maximized while variance of a background dataset is minimized. We previously described how contrastive PCA problems can be written as solutions to generalized eigenvalue problems that maximize particular instantiations of the Rayleigh quotient. Here, we discuss two extensions of contrastive PCA: we use kernel weighting from spatial PCA (k-{rho}PCA) to contrast spatial and non-spatial axes of variation, and separately solve the Rayleigh quotient in the space of basis function coefficients (f-{rho}PCA) to find modes of variation in functional data. Together, these extensions expand the scope of contrastive PCA while unifying disparate fields of spatial and functional methods within a single conceptual and mathematical framework. We showcase the utility of these extensions with several examples drawn from genomics, analyzing gene expression in cancer and immune response to vaccination.

12
Dynamic and Baseline Multi-Task Learning for Predicting Substance Use Initiation in the ABCD Study

Wei, M.; Zhang, H.; Peng, Q.

2026-04-13 addiction medicine 10.64898/2026.04.10.26350655 medRxiv
Top 0.1%
3.6%
Show abstract

Background: Early initiation of substance use is linked to later adverse outcomes, and risk factors come from multiple domains and are shared across substances. In our previous work, traditional time-to-event Cox models identified individual risk factors, but these models are not designed to jointly model multiple outcomes or capture complex non-linear relationships. Multi-task learning (MTL) can leverage shared structure across related outcomes to improve prediction and distinguish common versus substance-specific predictors. However, most MTL studies rely on baseline features and focus on single outcomes, which limits their ability to capture shared risk and temporal changes. Substance use initiation is a time-dependent process that unfolds during development and reflects changing exposures over time. Baseline-only models cannot capture these changes or represent risk dynamics. Discrete-time modeling provides a practical approach by estimating interval-level initiation risk and combining it into cumulative risk at the subject level. By integrating multi-task learning with dynamic modeling, it is possible to share information across outcomes while capturing how risk evolves over time, which may improve prediction performance. Methods: Using the Adolescent Brain Cognitive Development (ABCD) Study (release 5.1), we developed two complementary multi-task learning (MTL) frameworks to predict initiation of alcohol, nicotine, cannabis, and any substance use. A baseline MTL model predicted fixed- horizon (48-month) initiation using one record per participant, while a dynamic discrete-time MTL model incorporated longitudinal interval data to model time-varying risk. Both models used multi-domain environmental exposures, core covariates, and polygenic risk scores (PRS). Performance was evaluated on a held-out test set using AUROC, PR-AUC, and calibration metrics, and compared with single-task logistic regression (LR). Feature importance was assessed using permutation importance and compared with Cox proportional hazards models. Results: MTL showed comparable or improved performance relative to LR, with larger gains for low-prevalence outcomes (cannabis and nicotine). Incorporating longitudinal information led to consistent improvements across all outcomes. Dynamic models increased AUROC by +0.044 to +0.062 for MTL and +0.050 to +0.084 for LR, indicating that temporal information was the primary driver of performance gains. Feature importance analyses showed modest overlap across methods, with higher agreement between dynamic MTL and Cox models than static MTL. A small set of features, including externalizing behavior, parental monitoring, and developmental factors, were consistently identified across all approaches. Conclusions: Dynamic multi-task learning improves the prediction of substance use initiation by leveraging longitudinal structure and shared information across outcomes. While MTL provides additional gains, incorporating time-varying information is the dominant factor for improving performance. Combining baseline and dynamic frameworks offers a comprehensive strategy for identifying robust risk factors and modeling adolescent substance use initiation.

13
Comparison of methods for assessing effects of risk factors on disease progression in Mendelian randomization under index event bias

Zhang, L.; Higgins, I. A.; Dai, Q.; Gkatzionis, A.; Quistrebert, J.; Bashir, N.; Dharmalingam, G.; Bhatnagar, P.; Gill, D.; Liu, Y.; Burgess, S.

2026-03-02 epidemiology 10.64898/2026.02.26.26347193 medRxiv
Top 0.1%
3.5%
Show abstract

Mendelian randomization has emerged as a transformative approach for inferring causal relationships between risk factors and disease outcomes. However, applying Mendelian randomization to disease progression - a critical step in validating pharmacological targets - is hampered by index event bias. This form of selection bias occurs because analyses of disease progression are necessarily restricted to individuals who have already experienced the disease event. Here, we present a comprehensive evaluation of statistical methods designed to mitigate index event bias, including inverse-probability weighting, Slope-Hunter, and multivariable methods. We compare the performance of these methods in simulations and applied examples. Inverse-probability weighting methods reduce bias, but require individual-level data and will only fully eliminate bias when the disease event model is correctly specified. Slope-Hunter performed poorly in all simulation scenarios, even when its assumptions were fully satisfied. Multivariable methods worked best when including genetic variants that affect the incident disease event. However, if these genetic variants also affect disease progression directly, then the analysis will suffer from pleiotropy. Hence, if the same biological mechanisms affect disease incidence and progression, then multivariable methods will have little utility. But in such a case, analyses of disease progression are less critical, as conclusions reached from analyses of disease incidence are likely to hold for disease progression. Our findings indicate that no single method is a universal solution to provide reliable results for the investigation of disease progression. Instead, we propose a strategic framework for method selection based on data availability and biological context.

14
Mediation analysis in longitudinal data: an unbiased estimator for cumulative indirect effect

Li, Y.; Cabral, H.; Tripodis, Y.; Ma, J.; Levy, D.; Joehanes, R.; Liu, C.; Lee, J.

2026-04-20 epidemiology 10.64898/2026.04.18.26351189 medRxiv
Top 0.1%
3.3%
Show abstract

Mediation analysis quantifies how an exposure affects an outcome through an intermediate variable. We extend mediation analysis to capture the cumulative effects of longitudinal predictors on longitudinal outcomes. Our proposed model examines how mediators transmit the effects of the current and previous exposure on the current outcome. We construct a least-squared estimator for cumulative indirect effect (CIE) and used three approaches (exact form, delta method, and bootstrap procedure) to estimate its standard error (SE). The estimator of CIE is unbiased with no unmeasured confounding and independent model errors between mediator model and outcome model at all time points, as shown in statistical inference and in simulations. While three SE estimates are numerically similar, bootstrap procedure is recommended due to its simplicity in implementation. We apply this method to Framingham Heart Study offspring cohort to assess if DNA methylation mediates the association of alcohol consumption with systolic blood pressure over two time points. We identify two CpGs (cg05130679 and cg05465916) as mediators and construct a composite DNA methylation score from 11 CpGs, which mediates for 39% of the cumulative effect. In conclusion, we propose an unbiased estimator for CIE. Future studies will investigate the missingness in mediators and outcomes.

15
Generative AI-assisted Bayesian-frequentist Hybrid Inference in Single-cell RNA Sequencing Analysis for Genes Associated with Alzheimer's Disease

Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.

2026-04-20 geriatric medicine 10.64898/2026.04.17.26351142 medRxiv
Top 0.1%
2.1%
Show abstract

Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.

16
Comparing optimal transport and machine learning approaches for databases merging in scenarios involving missing data in covariates.Application to Medical Research

N'kam suguem, F.; DEJEAN, s.; Saint-Pierre, P.; Savy, N.

2026-01-26 bioinformatics 10.64898/2026.01.23.701369 medRxiv
Top 0.1%
2.1%
Show abstract

MotivationOne of the challenges encountered when merging heterogeneous observational clinical datasets is the recoding of categorical target variables that may have been measured differently across data sources. Standard machine learning-based approaches, such as Multiple Imputation by Chained Equations and the k-Nearest Neighbours method are compared with an Optimal Transport based algorithm (OTre-cod) when databases are altered by missing values in covariates or by imbalanced groups. The empirical performance in these realistic data integration settings remains underexplored. ResultsA comprehensive simulation study was conducted, varying sample size, group imbalance, signal-to-noise ratio, and mechanisms of missing data. The results demonstrate that OTrecod consistently achieves higher recoding accuracy compared with Multiple Imputation by Chained Equations and k-Nearest Neighbours, particularly in large, imbalanced and weak-signal scenarios. These findings are further illustrated using subsets of the National Child Development Study, where OTrecod and Multiple Imputation by Chained Equations minimised the distributional divergence between recoded social-class scales, while k-Nearest Neighbours produced less stable results. Availability and ImplementationThe source code supporting this study is publicly available at https://github.com/FloAI/CompareOT.

17
Nonlinear Mixed-Effects and Full Bayesian Population Pharmacokinetic Analysis of Ceftolozane-Tazobactam in Critically Ill Patients

Okunska, P.; Borys, M.; Rypulak, E.; Piwowarczyk, P.; Szczukocka, M.; Raszewski, G.; Czuczwar, M.; Wiczling, P.

2026-03-26 pharmacology and toxicology 10.64898/2026.03.24.713879 medRxiv
Top 0.1%
1.9%
Show abstract

1.Pharmacokinetic studies in critically ill patients are often constrained by small sample sizes, limiting the strength and generalizability of conclusions drawn solely from observed data. Bayesian inference offers a powerful strategy to address this challenge by incorporating prior knowledge. In this study, we evaluated two model-based approaches for characterizing the population pharmacokinetics of ceftolozane and tazobactam in critically ill patients, comparing nonlinear mixed-effects modeling with Bayesian hierarchical analyses. The Bayesian methods incorporated literature-derived prior information. The data was collected from 13 critically ill patients receiving 3.0 g of ceftolozane combined with tazobactam (2:1) via intravenous infusion. Pharmacokinetic modeling was performed using NONMEM and Stan software with the Torsten extension. Model diagnostics and graphical analyses were conducted in RStudio with relevant packages. In the absence of prior information, a one-compartment model with a limited set of parameters describing inter-individual variability adequately characterized the pharmacokinetics of ceftolozane and tazobactam. When prior information was incorporated, a two-compartment model became feasible and yielded a characterization of parameter variability and correlations that was more consistent with published literature. The application of Bayesian inference ensured alignment with existing literature on ceftolozane and tazobactam pharmacokinetics and mitigated some systematic biases observed in the data-driven approaches. Moreover, the Bayesian approach enables direct decision-making by incorporating uncertainty into the analysis, as demonstrated by probability of target attainment analysis. Collectively, these results underscore the utility of Bayesian methods in pharmacokinetic modeling for critically ill patients, offering a robust framework for optimizing dosing strategies in data-limited settings.

18
A comparison of observation models for statistical inference of emerging disease transmission dynamics: Application to SARS-CoV-2

Domenech de Celles, M.; Kramer, S. C.

2026-01-29 epidemiology 10.64898/2026.01.27.26344924 medRxiv
Top 0.1%
1.9%
Show abstract

1Parameter estimation is often necessary to inform transmission models of infectious diseases. This estimation requires choosing an observation model that links the model outputs to the observed data. Although potentially consequential, this choice has received little attention in the literature. Here, we aimed to compare eight observation models, including common distributions such as the Poisson, binomial, negative binomial, and normal (equivalent to least-squares estimation). Using Bayesian inference methods, we fit an SIR-like model to daily case reports during the first wave of COVID-19 in Belgium, Finland, Germany, and the UK. We found considerable differences in the log-likelihoods of the observation models, spanning three orders of magnitude between the best and the worst. Compared with the best models, the binomial, Poisson, and normal models received no support due to their rigid variance structures. Additionally, the binomial and Poisson models produced overly narrow prediction and confidence intervals, especially for key parameters such as the basic reproduction number. The other five models--each with a free dispersion parameter scaling the variance to the mean--performed significantly better, with the negative binomial model ranking first in three countries. We conclude that flexible observation models are essential for transmission models to accurately capture all sources of uncertainty.

19
Estimating the Smallest Worthwhile Difference (SWD) of Psychotherapy for Alcohol Use Disorder: Protocol for a Cross-Sectional Survey

Sahker, E.; Lu, I.; Eddie, D.; So, R.; Luo, Y.; Omae, K.; Tajika, A.; Angelo, J. P.; Crisp, T.; Coffin, B.; Furukawa, T. A.

2026-02-27 addiction medicine 10.64898/2026.02.16.26346220 medRxiv
Top 0.2%
1.8%
Show abstract

BackgroundPsychotherapy is proven efficacious for the treatment of alcohol use disorder (AUD). However, the patient-perceived importance of its effect is not fully appreciated in the evidence base. The smallest worthwhile difference (SWD) represents the smallest beneficial effect of an intervention that patients deem worthwhile in exchange for the harms, expenses, and inconveniences associated with the intervention, and facilitates the interpretation of patient perceived worthiness of an intervention. MethodsThe proposed study will estimate the SWD of NIAAA recommended psychotherapies for AUD treatment with English-speaking American respondents aged 18 and older. Primary participants will be recruited using the Prolific research crowdsourcing site. The SWD will be estimated using the Benefit-Harm Trade-off Method, presenting survey respondents with variable, hypothetical magnitudes of psychotherapy outcomes to find the smallest acceptable effect over a natural remission alternative. The overall average SWD, and subgroup distributions by participant AUD treatment experiences and AUD symptomology will be described. Secondary findings will estimate the smallest recommendable risk difference for AUD psychotherapy from providers and criminal justice professionals. Expected ResultsWe expect to find an estimate of the SWD for AUD psychotherapy. Further, we expect that the SWD will vary between clinical subgroups based on AUD symptomology and treatment experiences. We expect differences in SWDs between the general population and those of providers and criminal justice professionals. Findings from this project will inform the treatment decision process about psychotherapy during the clinical consultation for people with AUD.

20
HORDCOIN: A Software Library for Higher Order Connected Information and Entropic Constraints Approximation

Raffaelli, G. T.; Kislinger, J.; Kroupa, T.; Hlinka, J.

2026-02-10 bioinformatics 10.64898/2026.02.08.704639 medRxiv
Top 0.2%
1.8%
Show abstract

Background and objectiveQuantifying higher-order statistical dependencies in multivariate biomedical data is essential for understanding collective dynamics in complex systems such as neuronal populations. The connected information framework provides a principled decomposition of the total information content into contributions from interactions of increasing order. However, its application has been limited by the computational complexity of conventional maximum entropy formulations. In this work, we present a generalised formulation of connected information based on maximum entropy problems constrained by entropic quantities. MethodsThe entropic-constraint approach, contrasting with the original constraints based on marginals or moments, transforms the original nonconvex optimisation into a tractable linear program defined over polymatroid cones. This simplification enables efficient, robust estimation even under undersampling conditions. ResultsWe present theoretical foundations, algorithmic implementation, and validation through numerical experiments and real-world data. Applications to symbolic sequences, large-scale neuronal recordings, and DNA sequences demonstrate that the proposed method accurately detects higher-order interactions and remains stable even with limited data. ConclusionsThe accompanying open-source software library, HORDCOIN (Higher ORDer COnnected INformation), provides user-friendly tools for computing connected information using both marginal- and entropy-based formulations. Overall, this work bridges the gap between abstract information-theoretic measures and practical biomedical data analysis, enabling scalable investigation of higher-order dependencies in neurophysiological and other complex biological systems such as the genome.